Context-Dependent Multilingual Lexical Lookup for Under-Resourced Languages
نویسندگان
چکیده
Current approaches for word sense disambiguation and translation selection typically require lexical resources or large bilingual corpora with rich information fields and annotations, which are often infeasible for under-resourced languages. We extract translation context knowledge from a bilingual comparable corpora of a richer-resourced language pair, and inject it into a multilingual lexicon. The multilingual lexicon can then be used to perform context-dependent lexical lookup on texts of any language, including under-resourced ones. Evaluations on a prototype lookup tool, trained on a English–Malay bilingual Wikipedia corpus, show a precision score of 0.65 (baseline 0.55) and mean reciprocal rank score of 0.81 (baseline 0.771). Based on the early encouraging results, the context-dependent lexical lookup tool may be developed further into an intelligent reading aid, to help users grasp the gist of a second or foreign language text.
منابع مشابه
Lexicon+TX: rapid construction of a multilingual lexicon with under-resourced languages
Most efforts at automatically creating multilingual lexicons require input lexical resources with rich content (e.g. semantic networks, domain codes, semantic categories) or large corpora. Such material is often unavailable and difficult to construct for under-resourced languages. In some cases, particularly for some ethnic languages, even unannotated corpora are still in the process of collect...
متن کاملImproving Under-Resourced Language ASR Through Latent Subword Unit Space Discovery
Development of state-of-the-art automatic speech recognition (ASR) systems requires acoustic resources (i.e., transcribed speech) as well as lexical resources (i.e., phonetic lexicons). It has been shown that acoustic and lexical resource constraints can be overcome by first training an acoustic model that captures acoustic-to-multilingual phone relationships on languageindependent data; and th...
متن کاملLow Cost Construction of a Multilingual Lexicon from Bilingual Lists
Manually constructing multilingual translation lexicons can be very costly, both in terms of time and human effort. Although there have been many efforts at (semi-)automatically merging bilingual machine readable dictionaries to produce a multilingual lexicon, most of these approaches place quite specific requirements on the input bilingual resources. Unfortunately, not all bilingual dictionari...
متن کاملPronunciation Lexicon Development for Under-Resourced Languages Using Automatically Derived Subword Units: A Case Study on Scottish Gaelic
Developing a phonetic lexicon for a language requires linguistic knowledge as well as human effort, which may not be available, particularly for under-resourced languages. To avoid the need for the linguistic knowledge, acoustic information can be used to automatically obtain the subword units and the associated pronunciations. Towards that, the present paper investigates the potential of a rec...
متن کاملLanguage Identification for Under-Resourced Languages in the Basque Context
Automatic Speech Recognition (ASR) is a broad research area that absorbs many efforts from the research community. The interest on Multilingual Systems arouses in the Basque Country because there are three official languages (Basque, Spanish, and French), and there is much linguistic interaction among them, even if Basque has very different roots than the other two languages. The development of...
متن کامل